Avazu User Click-Through Rate (CTR) prediction for ads

  1. create data if it is huge and load it efficiently
  2. import data and check for Nans
  3. explore data: explore all columns and various relations
  4. check for multi-collinearity
  5. check for correlation analysis
  6. check for
In [212]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.io as pio
pio.renderers.default = "iframe_connected"
config={'showLink': True, 'displayModeBar': True}
from IPython.display import IFrame
from IPython.core.display import HTML, display

Importing Data

We have ad data provided by Avazu. The data was taken from a Kaggle Competition of Click-Through Rate Prediction. We use pandas to import a subset of a very large data. The original data is ~7GB in size, so we have taken the first 200000 rows for our analysis.

In [213]:
df = pd.read_csv("data/train_subset.csv")
In [214]:
df.head()
Out[214]:
id click hour C1 banner_pos site_id site_domain site_category app_id app_domain ... device_type device_conn_type C14 C15 C16 C17 C18 C19 C20 C21
0 10015140740686523448 0 2014-10-21 00:00:00 1005 0 85f751fd c4e18dd6 50e219e0 c51f82bc d9b5648e ... 1 0 21611 320 50 2480 3 297 100111 61
1 10070328440095985756 1 2014-10-21 00:00:00 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 1 0 15701 320 50 1722 0 35 100084 79
2 10093977800236804132 1 2014-10-21 00:00:00 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 1 0 15704 320 50 1722 0 35 -1 79
3 10104245282042838695 0 2014-10-21 00:00:00 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 1 0 15701 320 50 1722 0 35 100084 79
4 10105971003478261107 0 2014-10-21 00:00:00 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 1 0 15701 320 50 1722 0 35 -1 79

5 rows × 24 columns

Given above data, it is necessary we understand what each column means.

  1. id: ad identifier more specifically the add ID
  2. click: 0/1 for non-click/click (this is the target variable. indicates if this ad was clicked or not)
  3. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
  4. C1: anonymized categorical variable
  5. banner_pos: position when the ad might be displayed
  6. site_id: ID of the website
  7. site_domain: domain where the site is hosted
  8. site_category: category to which the site belongs
  9. app_id: application ID
  10. app_domain: application domain
  11. app_category: application category
  12. device_id: ID of the device from which the ad is clicked
  13. device_ip: Network IP to which the device was connected to while clicking on the ad (eg: 192.145.86.35)
  14. device_model: model of the device used for clicking the ad.
  15. device_type: type of device used for clicking the ad (eg: laptop, desktop, mobile)
  16. device_conn_type: connection type of the device (LAN, wifi, etc)
  17. C14-C21: anonymized categorical variables (although being anonymous, variables C15 and C16 seems to give the dimensions of the ad on the page in terms of pixels)
In [215]:
df.shape
Out[215]:
(200000, 24)

Our data has 200000 rows / records and 24 features / columns.

In [216]:
df.dtypes
Out[216]:
id                  uint64
click                int64
hour                object
C1                   int64
banner_pos           int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type          int64
device_conn_type     int64
C14                  int64
C15                  int64
C16                  int64
C17                  int64
C18                  int64
C19                  int64
C20                  int64
C21                  int64
dtype: object

Our features can be broadly classified into following categories:

  1. site features: site_id, site_category, site_domain
  2. app features: app_id, app_domain, app_category
  3. device features: device_id, device_ip, device_model, device_type, device_conn_type
  4. anonymized categorical features C1 & C14-C21
  5. other features: hour, banner_pos
  6. target variable: click
In [217]:
df["hour"] = pd.to_datetime(df["hour"])

Exploring and Preprocessing Data with Feature Engineering

1. Checking for NaNs and missing values

In [218]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                200000 non-null  uint64        
 1   click             200000 non-null  int64         
 2   hour              200000 non-null  datetime64[ns]
 3   C1                200000 non-null  int64         
 4   banner_pos        200000 non-null  int64         
 5   site_id           200000 non-null  object        
 6   site_domain       200000 non-null  object        
 7   site_category     200000 non-null  object        
 8   app_id            200000 non-null  object        
 9   app_domain        200000 non-null  object        
 10  app_category      200000 non-null  object        
 11  device_id         200000 non-null  object        
 12  device_ip         200000 non-null  object        
 13  device_model      200000 non-null  object        
 14  device_type       200000 non-null  int64         
 15  device_conn_type  200000 non-null  int64         
 16  C14               200000 non-null  int64         
 17  C15               200000 non-null  int64         
 18  C16               200000 non-null  int64         
 19  C17               200000 non-null  int64         
 20  C18               200000 non-null  int64         
 21  C19               200000 non-null  int64         
 22  C20               200000 non-null  int64         
 23  C21               200000 non-null  int64         
dtypes: datetime64[ns](1), int64(13), object(9), uint64(1)
memory usage: 36.6+ MB

The first thing which we check usually is if there is NaN values. Our data doesn't seem to have null values and all the data seems to be there. If in case we had null values, we could have replaced them by mean of values of the column one at a time, or looking at other related columns and then adapting our missing values accordingly.

2. Identify unique values for columns

In [219]:
def count_unique(d, columns):
    for column in columns:
        print("Number of Unique values in column {} is {}".format(column, str(len(d[column].unique()))))
In [220]:
columns = list(df.columns)
count_unique(df, columns)
Number of Unique values in column id is 200000
Number of Unique values in column click is 2
Number of Unique values in column hour is 240
Number of Unique values in column C1 is 7
Number of Unique values in column banner_pos is 7
Number of Unique values in column site_id is 1788
Number of Unique values in column site_domain is 1707
Number of Unique values in column site_category is 21
Number of Unique values in column app_id is 1732
Number of Unique values in column app_domain is 117
Number of Unique values in column app_category is 22
Number of Unique values in column device_id is 33169
Number of Unique values in column device_ip is 143819
Number of Unique values in column device_model is 3756
Number of Unique values in column device_type is 4
Number of Unique values in column device_conn_type is 4
Number of Unique values in column C14 is 1884
Number of Unique values in column C15 is 8
Number of Unique values in column C16 is 9
Number of Unique values in column C17 is 404
Number of Unique values in column C18 is 4
Number of Unique values in column C19 is 66
Number of Unique values in column C20 is 157
Number of Unique values in column C21 is 60

3. Exploring Click-Through Rate (CTR)

Here, we would like to know the Click-Through Rate (CTR) of the given dataset which we have. CTR is defined as the number of users who click an ad on a particular page to the total number of users who happen to visit that page. A higher CTR indicates that a lot of users were interested in our ads which we hosted on a particular website / page. Below we try to observe how many people in our data actually clicked the ad. We then calculate the CTR to get an estimate of how well our ad is performing.

In [232]:
fig = px.histogram(df, x="click")
fig.update_layout(title="Click histogram")
fig.write_html("plots/click_histogram.html")
# plot(fig, filename = 'plots/click_histogram.html')
# display(HTML('plots/click_histogram.html'))
IFrame('plots/click_histogram.html', height=600, width=1000)
Out[232]:
In [188]:
df["click"].value_counts()
Out[188]:
0    165748
1     34252
Name: click, dtype: int64
In [189]:
CTR = len(df[df["click"] == 1]) / len(df)
print("Click-Through Rate (CTR): {}".format(str(CTR)))
Click-Through Rate (CTR): 0.17126

From the above histogram and statistic, we can observe that about 166132 users do not click on our ad but only a small fraction of 33868 users happen to click our ad. Our CTR is ~17%. This means that about 83% of the people do not click on the ad at all!

4. Effects of Days and Time on Clicks

Here we explore our datetime feature hour and observe how CTR varies based on days and various time of the day. This is particularly important since it will help us give insights as to what kind of ads can we put and at what time of the day or week is the traffic most promising for giving us profits. We know that the amount of people doing online shopping on Black Friday or Cyber Monday is high. These days usually occur every year in October-November period. Likewise, since most of the people are working during the day, we can assume the amount of people visiting websites or clicking ads is higher during night time or any time after 5-6 pm more specifically after the official business hours. We also need to consider weekends as well as geographic locations for which our CTR scores can be significantly impacted. Other possible events include, election periods, festivals, etc where we might expect people to click on ads more. Hence, datetime based analysis needs to be done to understand the CTR trend

4.1 Days vs Clicks per hour

In [13]:
df["hour"].describe()
Out[13]:
count                  200000
unique                    240
top       2014-10-22 09:00:00
freq                     2238
first     2014-10-21 00:00:00
last      2014-10-30 23:00:00
Name: hour, dtype: object
In [14]:
click_day = df.groupby('hour').agg({'click':'sum'}).reset_index()
click_day.head()
Out[14]:
hour click
0 2014-10-21 00:00:00 101
1 2014-10-21 01:00:00 133
2 2014-10-21 02:00:00 156
3 2014-10-21 03:00:00 189
4 2014-10-21 04:00:00 182
In [15]:
fig = go.Figure(go.Scatter(name="clicks/day",
    x = click_day['hour'],
    y = click_day['click'],
    hovertemplate='Date: %{x|%d %B %Y} <br>Time: %{x|%H:%M:%S} <br>Day: %{x|%A} <br>Clicks: %{y}'
))

fig.update_layout(
    title = 'Trend of clicks grouped by day for all hours',
    xaxis_tickformat = '%d %B <br>%Y',
    xaxis_title = "Hourly clicks for 10 days",
    yaxis_title = "Number of Clicks"
)

fig.show()

Above we have a plot of the amount of clicks made every hour for the 10 days of data given between 21st October 2014 to 31st October 2014. We see peaks in the clicks made on 22nd and 28th of October somewhere around mid-day. Likewise, we see a surprising dip during 25th of October at night. Apart from these 3 outlier peaks, the hourly click rate seems pretty stationary and the trend seems to be almost the same for the rest of the days.

4.2 Hours vs Clicks

Earlier we plotted days vs clicks done by users per hour. Now, we would like to see how many clicks were made for each hour for all the days. Basically we sum all the clicks made for the first hour of all the days, the second hour for all the days, etc for all the hours in the day. Our X-axis will be hours all the 24 hours. This will give us the trend of the how the clicks vary every day for a particular hour. We perform feature engineering to achieve our plot

In [16]:
df['hour_of_day'] = df["hour"].apply(lambda x: str(x.time())[:5])
click_hr = df.groupby('hour_of_day').agg({'click':'sum'}).reset_index()
click_hr.head()
Out[16]:
hour_of_day click
0 00:00 738
1 01:00 948
2 02:00 1063
3 03:00 1228
4 04:00 1463
In [17]:
fig = go.Figure(go.Scatter(name="clicks/hr",
    x = click_hr['hour_of_day'],
    y = click_hr['click'],
    hovertemplate='Time: %{x} <br>Clicks: %{y}'
))

fig.update_layout(
    title = 'Trend of clicks grouped by hours for all the days',
    xaxis_title = "Hours",
    yaxis_title = "Number of Clicks"
)

fig.show()

From the above trend the highest clicks are made every day during 12:00 pm to 2:00 pm. The amount of clicks done is less during the initial and the later part of the day. This means that people become more active during the business hours of the day, rather than towards the end of the day or the beginning of the day! This confirms our earlier observation as described by the earlier chart.

4.3 Hourly Impressions based on clicks

Impressions are when ads are rendered on a user screen or any other form of digital media platform. Impressions are not action-based and are merely defined by a user potentially seeing the advertisement. Hence, it doesn't really matter if someone clicked the ad or not, the impressions are just the fact that the ad was observed by any person and they saw it with/without any action taken on it.

In [18]:
df.head()
Out[18]:
id click hour C1 banner_pos site_id site_domain site_category app_id app_domain ... device_conn_type C14 C15 C16 C17 C18 C19 C20 C21 hour_of_day
0 10015140740686523448 0 2014-10-21 1005 0 85f751fd c4e18dd6 50e219e0 c51f82bc d9b5648e ... 0 21611 320 50 2480 3 297 100111 61 00:00
1 10070328440095985756 1 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 100084 79 00:00
2 10093977800236804132 1 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15704 320 50 1722 0 35 -1 79 00:00
3 10104245282042838695 0 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 100084 79 00:00
4 10105971003478261107 0 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 -1 79 00:00

5 rows × 25 columns

We group our data firstly based on hour and then based on click. This helps us achieve multi-level grouping. However, we would like to bring all data to one level hence we unstack() it and plot a graph comparing clicks and non-clicks for every hour for all the days

In [19]:
impressions = df.groupby(['hour_of_day', 'click']).size().unstack().reset_index()
impressions.head()
Out[19]:
click hour_of_day 0 1
0 00:00 3538 738
1 01:00 3952 948
2 02:00 5003 1063
3 03:00 5612 1228
4 04:00 7997 1463
In [20]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=impressions["hour_of_day"], y=impressions[1], 
           hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(55, 83, 109)'),
    go.Bar(name='Not Clicked', x=impressions["hour_of_day"], y=impressions[0], 
           hovertemplate='Time: %{x} <br>Clicks: %{y}', marker_color='rgb(26, 118, 255)')
])
# Change the bar mode
fig.update_layout(
    title = 'Hourly Impressions based on Clicks',
    xaxis_title = "Hour of the day",
    yaxis_title = "Impressions / hr",
    barmode='group',
    )
fig.show()

Above figure shows us hourly impressions, which means that for every hour, a significantly high number of people saw the ads but only a fraction of them actually clicked it and were forwarded to a landing page.

4.4 Hourly Click-Through Rate (CTR)

Earlier we saw hourly and daily clicks made on our ads. Now we would like to observe the hourly Click-Through Rate (CTR). Click-Through Rate is the number of times the ad was clicked by the total impressions. We calculate how many times the ad was clicked in an hour and divide it by the total impressions of that hour. This will give us hourly Click-Through Rate.

In [21]:
just_clicks = df[df['click'] == 1]
hourly_ctr = df[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()
hourly_ctr = hourly_ctr.rename(columns={'click': 'impressions'})
hourly_ctr["clicks"] = just_clicks[["hour_of_day", "click"]].groupby(["hour_of_day"]).count().reset_index()["click"]
hourly_ctr["CTR"] = hourly_ctr["clicks"] / hourly_ctr["impressions"] * 100
hourly_ctr.head()
Out[21]:
hour_of_day impressions clicks CTR
0 00:00 4276 738 17.259121
1 01:00 4900 948 19.346939
2 02:00 6066 1063 17.523904
3 03:00 6840 1228 17.953216
4 04:00 9460 1463 15.465116
In [39]:
fig = px.bar(hourly_ctr, x='hour_of_day', y='CTR',
             labels={"hour_of_day": "Time"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'Hourly CTR',
    xaxis_title = "Hour of the day",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

Contrary to what we observed earlier of the clicks being higher during the afternoon time or during the mid of the day, the CTR values suggest that a higher number of users click an ad relative to the impressions during midnight at around 1:00 am and likewise, the second highest peaks are at 3:00 pm. - 4:00 pm in the evening. If we just consider impressions then mid-night had less impressions relative to other times of the day and the same goes to the number of clicks done on the ad; however, considering both, it's an interesting trend to see the CTR to be high during the early time of the day.

4.5 Daily CTR

Now that we know how the CTR trend is for every hour of the day, let's observe how it is for every day in the week. We will basically observe 3 things:

  1. Number of clicks made for each day i.e trend of clicks for each day
  2. Number of impressions we had for each day for both the click and no-click cases
  3. Daily CTR projections to observe how the trend if for each day
In [23]:
df.head()
Out[23]:
id click hour C1 banner_pos site_id site_domain site_category app_id app_domain ... device_conn_type C14 C15 C16 C17 C18 C19 C20 C21 hour_of_day
0 10015140740686523448 0 2014-10-21 1005 0 85f751fd c4e18dd6 50e219e0 c51f82bc d9b5648e ... 0 21611 320 50 2480 3 297 100111 61 00:00
1 10070328440095985756 1 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 100084 79 00:00
2 10093977800236804132 1 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15704 320 50 1722 0 35 -1 79 00:00
3 10104245282042838695 0 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 100084 79 00:00
4 10105971003478261107 0 2014-10-21 1005 0 1fbe01fe f3845767 28905ebd ecad2386 7801e8d9 ... 0 15701 320 50 1722 0 35 -1 79 00:00

5 rows × 25 columns

In [24]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df["day_of_week"] = df["hour"].apply(lambda x: days[x.weekday()])
click_days = df.groupby("day_of_week").agg({"click": "sum"}).reset_index()
click_days['day_of_week'] = pd.Categorical(click_days['day_of_week'], categories=days, ordered=True)
click_days = click_days.sort_values('day_of_week')
click_days.head(7)
Out[24]:
day_of_week click
1 Monday 2922
5 Tuesday 7693
6 Wednesday 7249
4 Thursday 6967
0 Friday 2873
2 Saturday 3079
3 Sunday 3469
In [25]:
fig = go.Figure(go.Scatter(name="clicks/day",
    x = click_days['day_of_week'],
    y = click_days['click'],
    hovertemplate='Day: %{x} <br>Clicks: %{y}',
    marker_color="darkolivegreen"
))

fig.update_layout(
    title = 'Trend of clicks grouped by days',
    xaxis_title = "Day of Week",
    yaxis_title = "Number of Clicks"
)

fig.show()
In [26]:
impressions = df.groupby(['day_of_week', 'click']).size().unstack().reset_index()
impressions['day_of_week'] = pd.Categorical(impressions['day_of_week'], categories=days, ordered=True)
impressions = impressions.sort_values('day_of_week')
impressions.head(7)
Out[26]:
click day_of_week 0 1
1 Monday 13083 2922
5 Tuesday 38951 7693
6 Wednesday 37999 7249
4 Thursday 32918 6967
0 Friday 13579 2873
2 Saturday 13605 3079
3 Sunday 15613 3469
In [27]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=impressions["day_of_week"], y=impressions[1], 
           hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='indianred'),
    go.Bar(name='Not Clicked', x=impressions["day_of_week"], y=impressions[0], 
           hovertemplate='Day: %{x} <br>Clicks: %{y}', marker_color='lightsalmon')
])
# Change the bar mode
fig.update_layout(
    title = 'Daily Impressions based on Clicks',
    xaxis_title = "Day of week",
    yaxis_title = "Impressions / day",
    barmode='group',
    )
fig.show()
In [28]:
just_clicks = df[df['click'] == 1]
daily_ctr = df[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()
daily_ctr = daily_ctr.rename(columns={'click': 'impressions'})
daily_ctr["clicks"] = just_clicks[["day_of_week", "click"]].groupby(["day_of_week"]).count().reset_index()["click"]
daily_ctr["CTR"] = daily_ctr["clicks"] / daily_ctr["impressions"] * 100
daily_ctr['day_of_week'] = pd.Categorical(daily_ctr['day_of_week'], categories=days, ordered=True)
daily_ctr = daily_ctr.sort_values('day_of_week')
daily_ctr.head(7)
Out[28]:
day_of_week impressions clicks CTR
1 Monday 16005 2922 18.256795
5 Tuesday 46644 7693 16.493011
6 Wednesday 45248 7249 16.020598
4 Thursday 39885 6967 17.467720
0 Friday 16452 2873 17.462922
2 Saturday 16684 3079 18.454807
3 Sunday 19082 3469 18.179436
In [29]:
fig = px.bar(daily_ctr, x='day_of_week', y='CTR',
             labels={"day_of_week": "Day"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'Daily CTR',
    xaxis_title = "Day of week",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

Our daily CTR graph shows that on Saturday and on Sunday the chances of the ad being clicked is higher. This is reasonable since on weekends people will have more time to spend online and come across ads and click them.

Now that we have understood the effect of clicks based on hours and days in the week, with different combinations, let us understand effect of other variables on the target click.

5. Effect of Site features of Clicks

The kind of website hosting our ads as a huge impact on our clicks. Is the website a famous one, is it a commercial e-commerce one, or is it just a blogging site, etc; a lot of factors related to site plays an important role into whether a person will click an ad rendered on it. Firstly as calculated earlier, we have 1788 unique websites.

5.1 Effects of Site id on clicks and CTR

In [30]:
print("Number of unique websites: {}".format(str(len(df["site_id"].unique()))))
Number of unique websites: 1788
In [113]:
# top5 websites based on number of ads displayed in them
siteids = df["site_id"].value_counts()[:5].index
site_impressions = df["site_id"].value_counts()[:5].values
print("Top5 websites based on impressions: \n{}".format(siteids))
Top5 websites based on impressions: 
Index(['85f751fd', '1fbe01fe', 'e151e245', 'd9750ee7', '5b08c53b'], dtype='object')
In [114]:
top5_sites = df[(df["site_id"].isin(siteids))]
top5_sites_click = top5_sites.groupby(['site_id', 'click']).size().unstack().reset_index()
top5_sites_click = top5_sites_click.sort_values(by=1, ascending=False).reset_index()
top5_sites_click["site_impressions"] = site_impressions
top5_sites_click = top5_sites_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_sites_click.columns.name = None
top5_sites_click = top5_sites_click.drop(["index"], axis=1)
top5_sites_click.head()
Out[114]:
site_id Not Clicked Clicked site_impressions
0 85f751fd 63793 8706 72499
1 1fbe01fe 25416 6749 32165
2 e151e245 9218 3901 13119
3 5b08c53b 2330 2123 4741
4 d9750ee7 3361 1380 4453
In [115]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Clicked"],
           hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_sites_click["site_id"], y=top5_sites_click["Not Clicked"], 
           hovertemplate='Site ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Sites based on Clicks',
    xaxis_title = "Top5 Site IDs",
    yaxis_title = "Impressions / site",
    barmode='group',
    )
fig.show()

Of the 1788 sites on which our ads are placed, we have the top 5 sites in terms of amount of impressions they had. As before, a lot of people happen to see the ads but only few of them end up clicking on them. This is evident by the green bars shown above.

In [116]:
top5_sites_click['CTR'] = top5_sites_click['Clicked'] / top5_sites_click['site_impressions'] * 100
top5_sites_click.head()
Out[116]:
site_id Not Clicked Clicked site_impressions CTR
0 85f751fd 63793 8706 72499 12.008441
1 1fbe01fe 25416 6749 32165 20.982434
2 e151e245 9218 3901 13119 29.735498
3 5b08c53b 2330 2123 4741 44.779582
4 d9750ee7 3361 1380 4453 30.990344
In [117]:
fig = px.bar(top5_sites_click, x='site_id', y='CTR',
             labels={"site_id": "Site Id"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Sites',
    xaxis_title = "Top5 Site Ids",
    yaxis_title = "Click-Through Rate (CTR)",
    )
fig.show()

We see that although site id 85f751fd had more impressions, site id 5b08c53b had high CTR value. So it might be the case that this sight must be having keywords which really describe the ads and that on having clicked on it the user is directed to an appropriate landing page.

5.2 Effects of Site Domain on Clicks and CTR

In [129]:
print("Number of unique domains: {}".format(str(len(df["site_domain"].unique()))))
Number of unique domains: 1707
In [131]:
# top5 domains based on number of ads displayed in them
sitedomains = df["site_domain"].value_counts()[:5].index
domain_impressions = df["site_domain"].value_counts()[:5].values
print("Top5 site domains based on impressions: \n{}".format(sitedomains))
Top5 site domains based on impressions: 
Index(['c4e18dd6', 'f3845767', '7e091613', '7687a86e', '98572c79'], dtype='object')
In [123]:
top5_domains = df[(df["site_domain"].isin(sitedomains))]
top5_domains_click = top5_domains.groupby(['site_domain', 'click']).size().unstack().reset_index()
top5_domains_click = top5_domains_click.sort_values(by=1, ascending=False).reset_index()
top5_domains_click["domain_impressions"] = domain_impressions
top5_domains_click = top5_domains_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_domains_click.columns.name = None
top5_domains_click = top5_domains_click.drop(["index"], axis=1)
top5_domains_click.head()
Out[123]:
site_domain Not Clicked Clicked domain_impressions
0 c4e18dd6 65812 9307 75119
1 f3845767 25416 6749 32165
2 7e091613 12265 4288 16553
3 7687a86e 3381 2949 6330
4 98572c79 3500 1395 4895
In [124]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Clicked"],
           hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_domains_click["site_domain"], y=top5_domains_click["Not Clicked"], 
           hovertemplate='Domain ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Domains based on Clicks',
    xaxis_title = "Top5 Site Domains",
    yaxis_title = "Impressions / domain",
    barmode='group',
    )
fig.show()

Our websites are described by a domain. If a domain is descriptive and apt then chances of people visiting it is higher, although it does not necessarily guarantee they will click the ad. If the ad is not relevant to your content or related to your core idea, it will have less CTR.

In [125]:
top5_domains_click['CTR'] = top5_domains_click['Clicked'] / top5_domains_click['domain_impressions'] * 100
top5_domains_click.head()
Out[125]:
site_domain Not Clicked Clicked domain_impressions CTR
0 c4e18dd6 65812 9307 75119 12.389675
1 f3845767 25416 6749 32165 20.982434
2 7e091613 12265 4288 16553 25.904670
3 7687a86e 3381 2949 6330 46.587678
4 98572c79 3500 1395 4895 28.498468
In [128]:
fig = px.bar(top5_domains_click, x='site_domain', y='CTR',
             labels={"site_domain": "Domain ID"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Domains',
    xaxis_title = "Top5 Domains",
    yaxis_title = "Click-Through Rate (CTR)"
    )
fig.show()

Again, the 4th site has higher CTR although it had less impressions overall as compared to site 1

5.3 Effects of Site Category on Clicks and CTR

In [130]:
print("Number of website categories: {}".format(str(len(df["site_category"].unique()))))
Number of website categories: 21
In [134]:
# top5 site categories based on number of ads displayed in them
sitecategories = df["site_category"].value_counts()[:5].index
category_impressions = df["site_category"].value_counts()[:5].values
print("Top5 site categories based on impressions: \n{}".format(sitecategories))
Top5 site categories based on impressions: 
Index(['50e219e0', 'f028772b', '28905ebd', '3e814130', 'f66779e6'], dtype='object')
In [135]:
top5_categories = df[(df["site_category"].isin(sitecategories))]
top5_categories_click = top5_categories.groupby(['site_category', 'click']).size().unstack().reset_index()
top5_categories_click = top5_categories_click.sort_values(by=1, ascending=False).reset_index()
top5_categories_click["category_impressions"] = category_impressions
top5_categories_click = top5_categories_click.rename(columns={0: 'Not Clicked', 1: "Clicked"})
top5_categories_click.columns.name = None
top5_categories_click = top5_categories_click.drop(["index"], axis=1)
top5_categories_click.head()
Out[135]:
site_category Not Clicked Clicked category_impressions
0 f028772b 51392 11330 82064
1 50e219e0 71442 10622 62722
2 28905ebd 28778 7714 36492
3 3e814130 10608 4264 14872
4 f66779e6 1111 50 1161
In [138]:
fig = go.Figure(data=[
    go.Bar(name='Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Clicked"],
           hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='seagreen'),
    go.Bar(name='Not Clicked', x=top5_categories_click["site_category"], y=top5_categories_click["Not Clicked"], 
           hovertemplate='Category ID: %{x} <br>Clicks: %{y}', marker_color='firebrick')
])
# Change the bar mode
fig.update_layout(
    title = 'Top5 Categories based on Clicks',
    xaxis_title = "Top5 Site Categories",
    yaxis_title = "Impressions / site category",
    barmode='group',
    )
fig.show()

Sites can belong to various categories - ecommerce websites, healthcare websites, education websites, etc. Each category has various websites of different domains. Above graph shows how impressions vary based on site category. For instance the 2nd site category has highest impressions. Maybe it might depict ecommerce site like Amazon or Ebay which has higher footprint then a relatively less visited website like a hospital website or maybe an educational blog site catered to a specific audience.

In [139]:
top5_categories_click['CTR'] = top5_categories_click['Clicked'] / top5_categories_click['category_impressions'] * 100
top5_categories_click.head()
Out[139]:
site_category Not Clicked Clicked category_impressions CTR
0 f028772b 51392 11330 82064 13.806298
1 50e219e0 71442 10622 62722 16.935047
2 28905ebd 28778 7714 36492 21.138880
3 3e814130 10608 4264 14872 28.671329
4 f66779e6 1111 50 1161 4.306632
In [140]:
fig = px.bar(top5_categories_click, x='site_category', y='CTR',
             labels={"site_category": "Site Category ID"}, 
             color='CTR',
             height=400)

fig.update_layout(
    title = 'CTR values of Top5 Site Categories',
    xaxis_title = "Top5 Site Categories",
    yaxis_title = "Click-Through Rate (CTR)"
    )
fig.show()

As before CTR values is higher for 4th site category although its impressions are lower.

6. Effect of Device Features on Clicks

6.1 Effect of Device id on Clicks

6.2 Effect of Device Type on Clicks

7. Effect of App Features on Clicks

7.1 Effect of App Category on Clicks

In [ ]: